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1.  Introduction 

This  report  summarizes  the  findings  of  our  research  supported  by  the 
U.  S.  Army  Research  Office  through  grant  DA-ARO-D-31-124-73-G157.  Most 
of  the  results  reported  here  are  published  In  the  technical  papers,  Ph.D. 
dissertations  and  M.S.  theses  given  In  the  References.  Eleven  graduate  | 

students  have  been  supported  by  the  grant  to  pursue  advanced  degrees  at 
the  University  of  California,  Berkeley. 

This  Investigation  has  been  an  attempt  to  establish  some  foundation 
for  solving  design  and  operational  problems  related  to  computer  architectures 
which  utilize  concurrency  In  computation  to  attain  more  computing  power 
than  Is  allowed  by  today's  electronic  technology.  The  emphasis  of  our 
research  under  the  grant  has  been  on  pipeline  architectures  [Ram  77].  But 

i 

In  the  course  of  our  research,  some  new  results  In  parallel  processing  and  j 

memory  organization  are  developed.  Parallel  processing  complement  pipe- 
lining In  Increasing  the  capability  of  a processing  system  and  the  memory 
bottleneck  must  be  solved  If  all  the  advantages  of  pipelining  and  parallel 
processing  are  to  be  realized. 

We  shall  briefly  discuss  our  findings  In  these  research  areas  In  ^ 

i 

three  separate  sections,  one  for  each  topic.  A sunmary  section  Is  given  i 

at  the  end. 

1 

2.  Pipeline  Systems 

I 

2.1.  Overview  j 

I 

i 

Pipelining  can  be  viewed  as  a mode  of  exploiting  computational  ] 

parallelism.  It  can  be  roughly  defined  as  a technique  of  imbedding 
concurrency  in  execution  by  constructing  a system  configuration  composed 
of  Independent  autonomous  units  each  dedicated  to  perform  a specific 
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subfunction  in  an  overlapped  mode  with  the  others  [Ram  74b].  Such  an 
autonomous  unit  is  sometimes  called  a pipeline  segment  or  facility.  A 
task  once  initiated  flows  from  segment  to  segment  inside  the  pipe  to  be 
processed.  As  an  illustration,  the  schematic  of  a pipelined  instruction 
unit  is  shown  in  Fig.  1. 


Instruction 
Fetch 
600  nsec 


Decode 
200  nsec 


Operand 
Fetch 
600  nsec 


Execut ion 
400  nsec 


! 

[ Fig.  1.  An  example  pipelined  instruction  unit. 


I It  is  easy  to  see  that  if  an  uninterrupted  stream  of  instructions  is 

( 

allowed  to  flow  through  the  pipeline  shown  in  Fig.  1,  the  throughput  rate 
of  the  system  will  be  one  instruction  per  600  nanosecond.  The  basic 
objective  of  pipelining  is  to  maximize  throughput  through  efficient 
utilization  of  resources.  Compared  with  parallel  processing,  pipelining 
is  not  geared  to  shorten  response  time,  but  it  increases  system  throughput 
rate  in  a more  economical  way.  Furthermore,  the  effect  of  overhead  on  system 
throughput  rate  in  operating  a pipeline  system  is  almost  invisible.  This 
is  mainly  because  this  overhead  can  also  be  overlapped  with  the  other 
useful  operations.  Many  modern  computers  (e.g.  IBM  360/195,  Amdahl  470 
and  the  Cray  computers)  have  extensively  utilized  pipelining  in  their 
processor  architecture. 

The  structure  of  a pipeline  system  can  be  quite  complicated.  A 
segment  may  be  used  at  different  times  to  complete  the  processing  of  a 
task  (see  Fig.  2).  Sometimes  there  may  be  several  pipelines  In  the  system 
sharing  certain  facilities.  The  TIASC  arithmetic  unit  as  shown  in  Fig.  3 
Is  a notable  exampel  of  such  systems.  To  complicate  the  issues  further, 
the  tasks  to  be  processed  may  have  different  processing  and  resource 
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. Pipelined  arithmetic  unit  In  TIASC. 
An  example  of  multifunction  pipeline. 
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requirements.  The  tasks  may  require  different  amounts  of  time  to  be  processed 
In  a segment;  they  may  be  tightly  coupled  by  some  precedence  relations;  and 
If  not  appropriately  controlled  they  may  race  for  the  service  of  shared 
segments  and  resources  (e.g.  registers).  The  difficulty  of  most  effectively 
designing  and  controlling  the  operation  of  a pipeline  system  has  presented 
a very  challenging  area  of  fruitful  research.  We  have  extensively  surveyed 
existing  theoretical  results  and  practices  of  pipelining  in  [Ram  77a]. 

The  results  we  obtained  in  tackling  some  problems  in  pipelining  can 
be  grouped  into  four  areas:  (1)  modeling  of  pipeline  systems  for  effective 
control  and  performance  evaluation;  (2)  sequencing  strategies  and  their 
inherent  complexity;  (3)  design  methods  to  optimize  cost-effectiveness; 

(4)  program  restructuring  and  optimal  register  assignment  for  pipelined 
processors.  These  results  are  summarized  in  the  following  four  sections. 

2.2.  Modeling  of  pipeline  system 

The  objective  of  our  earlier  pipelining  studies  was  aimed  at  estab- 
lishing an  unified  analytical  model  for  the  design  and  control  of  pipeline 
systems.  In  [Red  72],  we  developed  a scheme  for  categorizing  various 
types  of  pipeline  systems  based  on  the  flow  pattern  of  tasks  in  the 
pipeline,  amount  of  intermediate  storage  and  the  nature  of  the  tasks 
(e.g.  absence  or  presence  of  dependence).  Subsequently,  a study  was  made 
on  the  various  scheduling  algorithms  for  each  category  [Red  73] . 

In  [Ram  74a],  reconf Igurable  shared  resource  pipeline  (RSRP)  systems 
were  considered  in  full  generality.  A useful  concept  of  efficiency 
measure  which  considers  cost,  speed  and  space-time  span  of  the  segments 
of  a pipeline  system  was  established  so  that  the  effectiveness  and 
feasibility  of  a pipeline  system  can  be  analyzed.  A RSRP  system  can  he 
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1-2-3-4 

P2:  1-5- 3-6 

(Speed  of  each  facility 
is  as  labeled.) 

Collision  matrix: 

t^j  - (15.») 

t^2  * ((4,10), (16,-)) 

*^21  “ 

^22  * 


Fig.  4.  Example  collision  matrix. 
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modeled  by  a digraph  where  a node  represents  a segment  and  an  arc  Indicates 
the  flow  of  control  from  one  segment  to  the  other.  A collision  matrix 
was  introduced  in  [Ram  74a].  It  can  be  used  to  avoid  collisions  Inside 
a RSRP  system.  (A  collision  occurs  when  more  than  one  task  attempts  to 
use  a shared  segment  at  the  same  time.)  Figure  4 shows  a RSRP  system  with 
its  corresponding  collision  matrix.  The  (l.j)*'^  entry  of  the  matrix 
represents  the  time  intervals  after  the  initiation  of  a task  to  flow 
through  pipe  i so  that  the  excitation  of  a task  to  flow  through  pipe  j 
will  not  casue  a collision. 

Concurrently  with  the  research  reported  above  a scheme  called  the 
dynamic  sequencing  and  segmentation  model  (DSSM)  was  developed  with  the 
objective  of  concealing  the  run-time  overhead  required  to  control  the 
operation  of  a general  parallel  and  pipeline  system  [Ram  74b].  A 
simulation  study  [Ram  73]  showed  the  steady  and  high  performance  of  the 
DSSM  and  the  high  feasibility  for  implementation. 

1.3.  Se(|uencing  in  pipeline  systems 

In  general,  the  major  obstacles  to  high  efficiency  in  pipeline  systems 
are  caused  by:  (1)  tlie  inherent  relationship  between  the  tasks,  (2)  the 
scarcity  of  some  expensive  facilities,  (3)  the  unequal  and  sometimes 
unpredictable  execution  time  of  the  tasks  in  a given  segment;  and  (4)  the 
conflicts  at  some  shared  resource  in  the  system.  These  factors  indicate 
one  critical  problem  to  be  solved:  the  optimal  sequencing  of  the  tasks  to 
be  admitted  into  the  pipeline  system.  A sequencing  strategy  is  optimal  in 
the  sense  that  maximum  throughput  is  achieved  while  collisions  on  shared 
resources  are  avoided. 


For  simple  RSRP  systems  wltli  a bounded  queue  of  ready  tasks,  an 


I 

efficient  lookahead  scheme  to  produce  locally  optimal  sequences  was 
developed  In  [Ram  74a].  The  method  can  be  extended  to  the  dynamic  case 
where  huge  tasks  are  involved,  for  example.  In  a computer  network.  The 
inherent  complexity  of  sequencing  tasks  In  a RSRP  system  was  studied  in 
[Ram  75b].  It  was  found  that  even  under  very  strong  assumptions  when  we 
have  a static  set  of  ready  tasks  and  the  execution  time  of  the  pipeline 
segments  are  deterministic,  the  optimal  sequencing  problem  of  pipeline 
system  is  NP-complete.  This  Implies  that  its  complexity  is  in  the  same 
class  as  the  classical  traveling  salesman  problem.  From  this  result,  the 
semi-exhaust ive  nature  of  an  optimal  strategy  for  pipeline  system  sequencing 
is  justified.  One  implication  is  that  for  low  level  pipeline  implementations, 
faster  heuristics  are  necessary.  Some  efficient  heuristics  were  proposed 
in  [Li  75]  and  simulation  experiments  have  demonstrated  their  effectiveness. 

2.5.  Design  methods 

The  design  problem  can  be  informally  stated  as  the  following:  Given 
an  application,  its  throughput,  reliability,  cost  and  other  requirements, 
how  to  most  cost-effectively  design  a pipeline  system  for  the  application. 

In  [Li  75],  we  developed  a set  of  algorithms  so  that  the  design  problem  can 
be  approached  in  a systematic,  way  rather  than  using  pure  instinct, 
experience  and  ad  hoc  solutions.  The  design  of  a pipeline  system  can  be 
approached  in  the  following  manner.  First  a basic  skeleton  machine  and 
some  relevant  cost  and  effect  functions  can  be  derived  based  on  the  system 
and  application  objectives.  A set  of  semi-dynamic  programming  strategies 
developed  in  [Li  75]  can  then  be  applied  to  obtain  analytically  the  most 
cost-effective  pipeline  design  without  exhaustive  enumeration.  Further 
by  appropriate  duplication  of  some  shared  resources,  a complex  RSRP  System 
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can  be  partitioned  into  slmplier  system.  The  result  of  tlie  appr 'priate 
partitioning  will  lead  to  reduced  control  complexity,  improved  throughput 
and  reliability.  The  partitioning  algorithms  are  extended  from  the 
algorithms  of  Kerninghan-Lin  and  Fora  Fulkersen.  Details  of  them  can 
be  found  in  [Li  75].  Another  problem  of  Interest  is  how  to  introduce 
redundancy  to  a ASKP  system  for  improvement  in  reliability  under  cost 
constraints.  A serai-dynamic  programming  algorithms  for  this  problem 
under  weak  assumptions  was  developed  also  in  [Li  75]. 

It  should  be  mentioned  here  that  these  analytical  de? -gn  algoritlim 
are  developed  not  for  the  decision  making,  but  as  a tool  o test  decision 
and  justify  prediction  or  experience.  In  all  cases  it  is  recommended 
that  both  analytical  and  simulation  evaluations  be  used  to  search  for 
a good  design. 

2. 5.  Program  restructuring  and  register  allocation 

The  characteristics  of  a program  have  significant  influence  on  the 
effectiveness  and  applicability  of  pipelining.  In  general,  a program 
must  possess  suitable  structures  as  well  as  abundant  parallelism  in  order 
to  fully  utilize  the  multi-pipelines  available.  Restructuring  a program 
is  the  process  of  organizing  the  original  program  in  such  a way  that 
the  final  code  has  "good"  characteristics  with  respect  to  the  system 
configuration  available.  These  characteristics  would  simplify  the  control 
procedure  and  thus  Improve  the  performance  of  the  system.  In  general, 
how  to  best  restructure  a program  is  a difficult  problem  and  it  is  intimately 
related  to  the  pipeline  configuration  and  the  sequencing  scheme  used. 

Some  efficient  program  restructuring  strategies  were  developed  in  (Li  75] 
and  they  are  studied,  using  simulation  experiments,  in  the  context  of 
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some  sequencing  heuristics  on  systems  whose  models  were  based  on  some 
existing  machines.  (They  Include  the  STAR  100  and  TIASC  pipes). 

A closely  related  problem  Is  how  to  most  effectively  assign  pipeline 
resources  to  tasks.  We  looked  Into  the  special  case  of  assigning 
registers  to  instructions  to  be  executed  in  a pipeline  processor.  In 
a pipeline  processing  system,  the  operand  fetch  and  preparation  phase 
of  one  instruction  can  be  overlapped  with  the  actual  execution  phase  of 
some  preceding  instructions.  While  the  latter  time  varies  from 
Instruction  to  instruction,  the  former  is  also  variable  depending  on 
the  register  assignment.  It  Is  crucial  for  pipeline  processors  to 
employ  a good  register  assignment  that  is  not  completely  insensitive 
to  the  processor  architecture,  otherwise  the  overlapping  power  can  be 
severely  damaged.  This  problem  Is  formulated  in  detail  in  [Li  77].  In 
general  the  problem  is  also  Inherently  difficult.  A non-exhaust ive  optimal 
algorithm  Is  found  under  some  strong  assumptions.  Efficient  heuristics 
with  certain  performance  bounds  are  proposed  for  other  cases. 

3.  Parallel  Processing 

The  term  parallel  processing  can  be  defined  as  the  mode  of  operation 
in  which  different  sections  of  a program  are  processed  by  several  units 
of  a multi-processing  system.  The  fundamental  objective  of  parallel 
processing  is  to  execute  a program  as  fast  as  possible  often  at  all 
costs  which  may  be  incurred.  Although  parallel  processing  is  quite 
different  from  pipelining  in  the  means  used  to  achieve  high  computing 
power,  these  two  techniques  domplement  each  other  ^n  imporvlng  the  per- 
formance of  a processing  system.  Modern  computers,  almost  without 
exception,  utilize  both  techniques  in  their  architecture.  The  parallel 
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execution  unit  of  the.  instruction  pipeline  of  the  IBM360/91  is  a notable 


example. 

Under  this  grant,  we  have  considered  two  different  approaches 
to  the  identification  of  parallelism  in  a program.  A set  of  language 
constructs  were  developed  by  which  the  programmer  can  explicitly  indicate 
parallelism.  On  the  execution  level,  a scheme  was  also  developed  to 
detect  and  control  parallelism  at  run  time. 

f 

f In  [Ram.  75a],  control  parts  of  parallel  programming  constructs, 

one  at  the  machine  level  and  the  other  at  the  source  level  were  defined. 
With  respect  to  these  constructs  technical  foundation  is  established 
for  detecting  useful  parallelism  hidden  in  a source  parallel  program 
as  well  as  for  restructuring  a program  into  the  one  leading  itself  to 
easier  analysis  and  more  effective  execution.  Another  objective  of 
these  language  construct  is  to  impose  constraints  on  the  program  structure 
so  that  the  program  reliability  can  be  improved.  Details  of  these 
constructs  can  be  found  in  [Kim  74]. 

More  recently,  we  developed  a scheme  to  detect  and  control  the 
execution  of  parallel  tasks  in  run  time  [Ram.  76].  Unlike  the 
traditional  lookahead  approach,  parallelism  detection  is  done  by  controls 
local  to  the  processors  in  the  new  scheme.  With  information  about  how 
variables  are  used  Ln  the  program  and  the  status  of  tasks  being  executed, 
these  local  controls  cooredlnate  with  each  other  and  synchronize  the 
action  of  the  processors  so  that  precedence  among  tasks  are  preserved. 

The  scheme  minimized  the  overhead  time  before  parallel  execution  and  thus 
in  effect  removed  the  parallelism  detection  procedure  as  a bottleneck 
of  thfc  parallel  processing  system. 
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Associative  search  methods  are  also  studied.  An  algorithm  for  ordered 
retrieval  was  developed  in  [Ram.  77b].  It  is  believed  that  the  algorithm  is 
the  best  one  ever  presented. 

4.  Memory  Organization 

The  problem  of  memory  contention  has  a significant  effect  on  the 
efficiency  of  any  pipeline  and/or  parallel  system.  This  problem  occurs 
when  more  than  one  task  being  executed  concurrently  needs  to  access  the 
same  memory  module. 

A scheme  using  intelligent  buffers  was  developed  for  an  interleaved 
memory  in  [Wah  76].  The  goal  there  is  to  improve  the  performance  of  the 
memory  by  the  addition  of  a small  number  of  buffers.  An  analytical 
model  based  on  discrete  Markov  chains  has  been  developed  to  evaluate 
Che  scheme.  The  results  show  that  significant  improvement  can  be  achieved 
with  a small  number  of  buffers.  The  analytical  result  is  verified 
by  a trace  driven  simulation. 

5.  Summary 

This  report  summarizes  the  research  findings  obtained  under  grant 
DA-ARO-D-31-123-73-G157.  Three  related  areas  of  advanced  computer 
architecture  are  investigated.  In  pipelining,  results  obtained  pertain 
to  the  modelling,  sequencing  control,  design  methods  and  tasks  resource 
allocation  problems  of  generalized  pipeline  systems.  In  parallel  processing, 
language  constructs  which  can  effectively  identify  parallel  tasks  and 
a new  scheme  to  detect  parallelism  in  run  time  are  developed.  An  efficient 
associative  search  algorithm  for  ordered  retrieval  is  also  proposed. 

Finally,  a scheme  using  Intelligent  buffers  to  improve  the  performance  of 
interleaved  memory  is  developed  and  analyzed. 
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